Reading and Understanding the Data

Data Cleaning and Manipulation

Dealing with data imbalance (categorical variables: object): Removing (object) categorical features that have more than 80% data associated to one single value.

Analysing object type categorical variables

Dealing with data imbalance (all numeric data):

Derived features:

Dealing with data imbalance (numeric variables): Removing numeric attributes that have more than 80% data associated to one single value.

Analysis of numeric categorical variables

Outlier Treatment

Data Visualization

Visualising the Target Variable: SalePrice

Data Preparation

Using Data Dictionary, to convert the categorical variables into numeric variables

Train Test Split

Feature Scaling

Recursive Feature Elimination

Model Building and Evaluation

Ridge Regression

Model Prediction and Evaluation Metrics:

The chart mentioned above displays the Top 10 predictors based on the Ridge Regression model, that are significant in predicting the sale price of the house.

Residual Analysis of Model

Checking for the error terms distribution. They should be normally distributed (as it is one of the major assumptions of linear regression).

Error terms seem to be approximately normally distributed with mean 0, so our assumption holds true.

Blue: Predicted (y_pred_train_r)- - - - - - - - - - - - - - Red: Actual (y_train)</b>

The residuals are scattered along (y=0) and are independent of each other.

Blue: Predicted (y_pred_test_r)- - - - - - - - - - - - - - Red: Actual (y_test)</b>

The residuals are scattered along (y=0) and are independent of each other.

Lasso Regression:

Residual Analysis of Model

Checking for the error terms distribution. They should be normally distributed (as it is one of the major assumptions of linear regression).

Blue: Predicted (y_pred_train_l)- - - - - - - - - - - - - - Red: Actual (y_train)</b>

The residuals are scattered along (y=0) and are independent of each other.

Error terms seem to be approximately normally distributed with mean 0, so our assumption holds true.

Blue: Predicted (y_pred_test_l)- - - - - - - - - - - - - - Red: Actual (y_test)</b>

The residuals are scattered along (y=0) and are independent of each other.

Changes to the model when we double the value of alpha for both ridge and lasso regression

Ridge Regression

Lasso Regression
Lasso Regression